Exploratory Data Analysis of White Wine Quality by HINDHUJA GUTHA
Introduction:
The dataset used in this EDA is related to white wine samples of the Portuguese “Vinho Verde” wine.For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines.
Attribute Information :
Input variables (based on physicochemical tests):
- fixed acidity (tartaric acid - g / dm^3)
- volatile acidity (acetic acid - g / dm^3)
- citric acid (g / dm^3)
- residual sugar (g / dm^3)
- chlorides (sodium chloride - g / dm^3
- free sulfur dioxide (mg / dm^3)
- total sulfur dioxide (mg / dm^3)
- density (g / cm^3)
- pH
- sulphates (potassium sulphate - g / dm3)
- alcohol (% by volume)
Output variable (based on sensory data):
- quality (score between 0 and 10)
Univariate Plots Section
In this section summary of all variables and information about dataset is analysed along with histograms for important variables and if necessary new variables are created
White Wine Dataset Summary
Null values in Dataset
## [1] 0
row count
## [1] 4898
column count
## [1] 13
Dataset Summary
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Dataset Observations:
- Datset has 13 columns . X refers to the number of the sampled white wine.
- Other columns are attributes for sample X
- volatile.acidity (0.08-1.10) range is less than fixed.acidity (3.8 to 14.2)
- Some samples doesn’t contain citric acid which suggests it’s used for flavor.
- Maximum value of sugar(65.8) is much greater than mean and median
- free sulphur dioxide also has max value 289 compared to mean value 35.31
- Density of all samples are almost consistent
- pH range is in between 2.720-3.820
- Minimum alcohol is 8% whereas Maximum is 14.20%
- In general avg alcohol is around 11-13% but it can also vary from 5.5-20% which suggests that samples in dataset has alcohol in required range.
- None of the samples are rated very bad(score 0) or very excellent(score 10)
- Maximum rated wine quality is 9. None of the sample is rated 10.
idealpH categorical variable
Based on the information 3-3.4 is best pH level for white wines, a categorical variable idealPh is created which takes value ‘Yes’ when pH level is in between 3-3.4 otherwise the value will be ‘No’
Fixed Acidity Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed Acidity plot

- fixed.acidity is normally distributed
- As fixed.acidity value increased from 6.75 there is a decrease in sample count
Volatile Acidity Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile Acidity plot

- Volatile acidity has some outliers and ditribution is slightly right skewed
- majority samples have volatile acidity below 0.4
Citric Acid Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric Acid plot

- citric.acid has some major outliers and distribution is normal
Residual Sugar Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual Sugar plot

- Distribution of residual sugar is right skewed
- There seems to be a negative association between count and residual sugar
- As residual.sugar value increased , there is decrease in sample count, which means majority samples have less residual sugar.
Total sulfur dioxide Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total Sulfur dioxide plot

- Total sulfur dioxide is normally distributed but there are some outliers
- Majority samples have total sulfur dioxide below 200
Density Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density plot

- density is normally distributed with some major outliers
pH Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH plot

- pH range is normally distributed
- 3 to 3.3 seems to be popular pH range
IdealpH category variable (3-3.4pH value) Summary
## No Yes
## 834 4064
Bar plot for idealPH variable

- More than 80% samples are in idealPH range
Sulphates Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulphates plot

- Sulphates distribution seems to be right skewed
Alcohol(% by volume) Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol(% by volume) plot

- Distribution for Alcohol is little bit right skewed
Quality Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Quality plot

- Distribution of quality levels is left skewed
- Majority samples have quality levels 5,6 & 7
Univariate Analysis
Number of Instances in white wine Dataset : 4898.
Number of Attributes: Total 13 columns,column “X” to represent sample & remaining 12 columns represent sample attributes
Missing Attribute Values: None
What is the structure of your dataset?
dataset is tidy and there are no missing values .
What is/are the main feature(s) of interest in your dataset?
residual sugar, alcohol,pH and fixed.acidity are main attributes
What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?
quality,sulphates and density can help in understanding more about wine
Did you create any new variables from existing variables in the dataset?
I have created idealpH category variable based on ideal pH range 3-3.4
Bivariate Plots Section
Based on above individual variable analysis ,in this section Bivariate Analysis is done to show comparisons and trends between two varaibles Scatterplot is a good way to analyze bivariate relationshhip , It is used. Plots are analysed for below pairs
- fixed.acidity vs pH
- sulphates vs pH
- fixed.acidity vs sulphates
- total.sulphur.dioxide vs quality
- residual.sugar vs alcohol
- residual.sugar vs quality
- alcohol vs quality
- idealPh vs quality

- negative association between fixed acidity and pH value though it is not a non-linear correlation.
- Increase in fixed acidity decreased pH value which suggests that fixed.acidity contributes majorly to the overall pH of the sample

- postivie association between sulphates value and pH value though it is not a linear correlation.
- Increase in sulphates value increased pH value which mean sulphates is one of the major attributes in deciding the overall pH of the sample

- From above plot,No association can be seen between fixed.acidity and sulphates ,Further analysis needs to be done to decide association
- Very few samples have fixed acidity above 9
- similarly there are few samples with sulphates value above 0.8

- Majority wine samples have total sulfur dioxide above 11.
- Quality levels 5,6 & 7 seems to have more samples

- There is negative association between residual sugar and alcohol %
- As sugar level decreased there is an increase in alcohol % and also decrease in sample count

- There seems to be a decrease in sample count when residual.sugar is increased
- Majority samples have residual.sugar value below 20
- from quality level 5 , as quality level increased there is decrease in overall sample count
- There is no direct association between residual sugar and quality

- Quality levels 5,6, & 7 have more samples
- Majority samples have alcohol level above 11%

- Quality depends on ideal pH range 3-3.4, as samples in ideal pH range have more count compared to samples which are not in ideal Ph range.
- This can be seen at all quality levels in above plot
Bivariate Analysis
Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?
- Negative association between fixed.acidity and pH value
- Positive association between sulphate and pH value
- Trend increased and decreased multiple times for fixed.acidity and sulphates
- Most of the sample shave residual sugar below 20 grams with some outliers
- Alochol and quality seems to have a postive correlation
- Majority samples have alcohol level above 10 % and are in ideal pH range
Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?
- Total sulfur dioxide is used to determine freshness of wine and majority samples have total sulfur dioxide above 100 which suggests that most wine samples are not aged well.
What was the strongest relationship you found?
- strongest relationship is found between fixed.acidity vs pH & sulphate vs pH
Multivariate Plots Section
In this section association between multiple variables is explored. Based on Bivariate plots Analysis below variables are analyzed together.
- fixed.acidity,sulphates and pH value
- residual.sugar,alcohol and quality
Relationship between Fixed Acidity , Sulphates and pH value

- From above plot we can deduct that as sulphate value increased there is a slight increase in pH value
- Also, when fixed.acidity is increased there is a decrease in pH value
- There seems to be negative association between fixed.acidity and pH
- Majority samples have fixed acidity above 6 and sulphates less than 0.6
Relationship between Alcohol , Residual sugar and Quality

- From above plot we can confirm that alcohol % also contributes to the quality.
- Samples with alcohol % above 11 have higher levels of quality and vice versa
- Majority samples have residual.sugar level below 10
- AS residual.sugar increased higher quality levels are decreased which suggests residual.sugar effects quality to some extent
Multivariate Analysis
Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
- Majority samples have fixed.acidity above 6 .
- Distribution of pH values w.r.t fixed.acidity & sulfates is normal and there are some outliers. This is further strengthened by Bivariate analysis between pH & sulphates , pH & fixed.acidity
Were there any interesting or surprising interactions between features?
- There is inverse correlation between fixed.acidity and pH value
- Higher quality samples are seen when alcohol is above 11%
Final Plots and Summary
Below are 3 plots with most interesting findings
Plot One

Description One
Ideal pH range of white wine is in between 3-3.4. From above plot we can see that more than 80% samples are in ideal pH range.
Plot Two

Description Two
- From above plot we can confirm that majority samples with ideal pH range have quality levels in between 5,6 and 7 compared to samples not having ideal pH
- quality has some positive association with ideal Ph value, which strengthens that ideal pH range is 3-3.4
Plot Three

Description Three
- From above plot we can confirm that alcohol % also contributes to the quality.
- Samples with alcohol % above 11 have higher levels of quality and vice versa
- Majority samples have residual.sugar level below 10
- AS residual.sugar increased higher quality levels are decreased which suggests residual.sugar effects quality to some extent
Reflection
- This is the tidiest dataset and it was easy to perform Univariate analysis.
- For Bivariate Analysis I couldnt figure out main attributes and supporting attributes initially, resulting in some re-work.
- After some research on white wine I was able to determine.Which suggests that prior knowledge of dataset attributes is required to make a solid analysis.
- Some of the important attributes of wine like age, tannins,types of grapes etc are not mentioned in dataset which would have helped in understanding more about quality.
- Given this Data would help in understanding more about wine and finding other main attributes that contributes to quality.